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Abstract — We develop a principled way of identifying proba- 
bility distributions whose independent and identically distributed 
realizations are compressible, i.e., can be well-approximated as 
sparse. We focus on Gaussian compressed sensing, an example 
of underdetermined linear regression, where compressibility is 
known to ensure the success of estimators exploiting sparse reg- 
ularization. We prove that many distributions revolving around 
maximum a posteriori (MAP) interpretation of sparse regularized 
estimators are in fact incompressible, in the limit of large problem 
sizes. We especially highlight the Laplace distribution and £^ 
regularized estimators such as the Lasso and Basis Pursuit 
denoising. We rigorously disprove the myth that the success of 
£^ minimization for compressed sensing image reconstruction 
is a simple corollary of a Laplace model of images combined 
with Bayesian MAP estimation, and show that in fact quite the 
reverse is true. To establish this result, we identify non-trivial 
undersampling regions where the simple least squares solution 
almost surely outperforms an oracle sparse solution, when the 
data is generated from the Laplace distribution. We also provide 
simple rules of thumb to characterize classes of compressible 
and incompressible distributions based on their second and 
fourth moments. Generalized Gaussian and generalized Pareto 
distributions serve as running examples. 

Index Terms — compressed sensing; linear inverse problems; 
sparsity; statistical regression; Basis Pursuit; Lasso; compressible 
distribution; instance optimality; maximum a posteriori estima- 
tor; high-dimensional statistics; order statistics. 



I. Introduction 

High-dimensional data is shaping the current modus 
operandi of statistics. Surprisingly, while the ambient dimen- 
sion is large in many problems, natural constraints and param- 
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eterizations often cause data to cluster along low-dimensional 
structures. Identifying and exploiting such structures using 
probabilistic models is therefore quite important for statistical 
analysis, inference, and decision making. 

In this paper, we discuss compressible distributions, whose 
independent and identically distributed (iid) realizations can be 
well-approximated as sparse. Whether or not a distribution is 
compressible is important in the context of many applications, 
among which we highlight two here: statistics of natural 
images, and statistical regression for linear inverse problems 
such as those arising in the context of compressed sensing. 

Statistics of natural images: Acquisition, compression, de- 
noising, and analysis of natural images (similarly, medical, 
seismic, and hyperspectral images) draw high scientific and 
commercial interest. Research to date in natural image mod- 
eling has had two distinct approaches, with one focusing on 
deterministic explanations and the other pursuing probabilistic 
models. Deterministic approaches (see e.g. [10], [12]) operate 
under the assumption that the transform domain representa- 
tions (e.g., wavelets, Fourier, curvelets, etc.) of images are 
"compressible". Therefore, these approaches threshold the 
transform domain coefficients for sparse approximation, which 
can be used for compression or denoising. 

Existing probabilistic approaches also exploit coefficient 
decay in transform domain representations, and learn proba- 
bilistic models by approximating the coefficient histograms or 
moment matching. For natural images, the canonical approach 
(see e.g. [27 1) is to fit probability density functions (PDF's), 
such as generalized Gaussian distributions and the Gaussian 
scale mixtures, to the histograms of wavelet coefficients while 
trying to simultaneously capture the dependencies observed in 
their marginal and joint distributions. 

Statistical regression: Underdetermined linear regression is 
a fundamental problem in statistics, applied mathematics, and 
theoretical computer science with broad applications — from 
subset selection to compressive sensing [17], [7] and inverse 
problems (e.g., deblurring), and from data streaming to error 
corrective coding. In each case, we seek an unknown vector 
, given its dimensionality reducing, linear projection 
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(m < N) obtained via a known encoding matrix 



as 



(1) 



where n G R"* accounts for the perturbations in the linear 
system, such as physical noise. The core challenge in decoding 
X from y stems from the simple fact that dimensionality 
reduction loses information in general: for any vector v £ 
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kemel($), it is impossible to distinguish x from x + v based 
on y alone. 

Prior information on x is therefore necessary to estimate 
the true x among the infinitely many possible solutions. It is 
now well-known that geometric sparsity models (associated to 
approximation of x from a finite union of low-dimensional 
subspaces in |4|) play an important role in obtaining 
"good" solutions. A widely exploited decoder is the £^ decoder 
Ai(y) := argminx:y=*s; ||x||i whose performance can be 
explained via the geometry of projections of the £^ ball in high 
dimensions fTSl. A more probabilistic perspective considers x 
as drawn from a distribution. As we will see, compressible 
iid distributions lU, ||9l countervail the ill-posed nature of 
compressed sensing problems by generating vectors that, in 
high dimensions, are well approximated by the geometric 
sparsity model. 

A. Sparsity, compressibility and compressible distributions 

A celebrated result from compressed sensing IITtI . ||6 | is that 
under certain conditions, a k-sparse vector x (with only k non- 
zero entries where k is usually much smaller than N) can be 
exactly recovered from its noiseless projection y using the £^ 
decoder, as long as m > fc log N/k. Possibly the most striking 
result of this type is the Donoho-Tanner weak phase transition 
that, for Gaussian sensing matrices, completely characterizes 
the typical success or failure of the £^ decoder in the large 
scale limit [161 . 

Even when the vector x is not sparse, under certain "com- 
pressibility" conditions typically expressed in terms of (weak) 
£P balls, the ^^-decoder provides estimates with controlled 
accuracy ifTsl . O, lfT4l . lITSl . Intuitively one should only 
expect a sparsity-seeking estimator to perform well if the 
vector being reconstructed is at least approximately sparse. 
Informally, compressible vectors can be defined as follows: 

Definition 1 (Compressible vectors). Define the relative best 
k-term approximation error ak{'x)q of a vector x as 



where ak{n)q :— inf||y||^<j, ||x — y||g is the best k-term 
approximation error of x, and ||x||q is the £''-norm of x, 
q € (0, (X)). By convention ||x||o counts the non-zero coeffi- 
cients ofx. A vector x g is g-compressible ifak{x)q <C 1 
for some k ^ N. 

This definition of compressibility differs slightly from those 
that are closely linked to weak £p balls in that, above, we 
consider relative error This is discussed further in Section JIl] 

When moving from the deterministic setting to the stochas- 
tic setting it is natural to ask when reconstruction guarantees 
equivalent to the deterministic ones exist. The case of typically 
sparse vectors is most easily dealt with and can be character- 
ized by a distribution with a probability mass of (1 — k/N) at 
zero, e.g., a Bernoulli-Gaussian distribution. Here the results 
of Donoho and Tanner still apply as a random vector drawn 
from such a distribution is typically sparse, with approximately 
k nonzero entries, while the £^ decoder is blind to the specific 
non-zero values of x. 



The case of compressible vectors is less straightforward: 
when is a vector generated from iid draws of a given distribu- 
tion typically compressible? This is the question investigated 
in this paper To exclude the sparse case, we restrict ourselves 
to distributions with a well defined density p{x). 

Broadly speaking, we can define compressible distributions 
as follows. 

Definition 2 (Compressible distributions). Let Xn{n £ N) 
be iid samples from a probability distribution with probability 
density function (PDF) p{x), and x^r = (^i, • • ■ , -'^a') G R^- 
The PDF p{x) is said to be q-compressible with parameters 
(e, k) when 

a.s. 

limsupa'fc„(xjv)g < e, {a.s.: almost surely); (3) 

for any sequence fcjv such that lim inf Ar_j.oo ^ > 

The case of interest is when e <C 1 and n <C 1: iid 
realizations of a g-compressible distribution with parameters 
(e, n) live in e-proximity to the union of /tA^-dimensional 
hyperplanes, where the closeness is measured in the ^'?-norm. 
These hyperplanes are aligned with the coordinate axes in A^- 
dimensions. 

One can similarly define an incompressible distribution as: 

Definition 3 (Incompressible distributions). Let Xn andxpf be 
defined as above. The PDF p{x) is said to be q-incompressible 
with parameters (e, k) when 

a.s. 

liminf a-fe„(xAr), > e, (4) 

for any sequence fcjv such that limsup^_^g^ ^ < k. 

This states that the iid realizations of an incompressible 
distribution live away from the e-proximity of the union of 
KA^-dimensional hyperplanes, where e « 1. 

More formal characterizations of the "compressibility" or 
the "incompressibility" of a distribution with PDF p{x) are 
investigated in this paper With a special emphasis on the 
context of compressed sensing with a Gaussian encoder we 
discuss and characterize the compatibility of such distributions 
with extreme levels of undersampling. As a result, our work 
features both positive and negative conclusions on achievable 
approximation performance of probabilistic modeling in com- 
pressed sensing^. 

B. Structure of the paper 

The main results are stated in Section together with a 
discussion of their conceptual implications. The section is 
concluded by Table [H which provides an overview at a glance 
of the results. The following sections discuss in more details 
our contributions, while the bulk of the technical contributions 
is gathered in an appendix, to allow the main body of the paper 
to concentrate on the conceptual implications of the results. 
As running examples, we focus on the Laplace distribution 
for incompressibility and the generalized Pareto distribution 
for compressibility, with a Gaussian encoder 4>. 

'Similai' ideas were recently proposed in |1 , liowever, wliile the authors 
explore the stochastic concepts of compressibility they do not examine the 
implications for signal reconstruction in compressed sensing type scenaiios. 
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II. Main results 

In this paper, we aim at bringing together the deterministic 
and probabilistic models of compressibility in a simple and 
general manner under the umbrella of compressible distribu- 
tions. To achieve our goal, we dovetail the concept of order 
statistics from probability theory with the deterministic models 
of compressibility from approximation theory. 

Our five "take home" messages for compressed sensing are 
as follows: 

1) minimization does not assume that the underlying 
coefficients have a Laplace distribution. In fact, the 
relatively flat nature of vectors drawn iid from a Laplace 
distribution makes them, in some sense, the worst for 
compressed sensing problems. 

2) It is simply not true that the success of £^ minimization 
for compressed sensing reconstruction is a simple corol- 
lary of a Laplace model of data coefficients combined 
with Bayesian MAP estimation, in fact quite the reverse. 

3) Even with the strongest possible recovery guaran- 
tees lITJI . Ill4l . compressed sensing reconstruction of 
Laplace distributed vectors with the £^ decoder offers 
no guarantees beyond the trivial estimator, x = 0. 

4) More generally, for high-dimensional vectors x drawn 
iid from any density with bounded fourth moment 
EX"* < oo, even with the help of a sparse oracle, there is 
a critical level of undersampling below which the sparse 
oracle estimator is worse (in relative i'^ error) than the 
simple least-squares estimator 

5) In contrast, when a high-dimensional vector x is drawn 
from a density with infinite second moment EX^ = oo, 
then the £^ decoder can reconstruct x with arbitrarily 
small relative £^ error 

A. Relative sparse approximation error 

By using Wald's lemma on order statistics, we charac- 
terize the relative sparse approximation errors of iid PDF 
realizations, whereby providing solid mathematical ground 
to the earlier work of Cevher ||9] on compressible distri- 
butions. While Cevher exploits the decay of the expected 
order statistics, his approach is inconclusive in characterizing 
the "incompressibility" of distributions. We close this gap 
by introducing a function Gq[p]{K) so that iid vectors as in 
Definition |2] satisfy limyv-i-oo (xjv), Gq[p]{K) when 
liniAT^oo k^/N = K e (0, 1). 

Proposition 1. Suppose x^r e is iid with respect to p{x) 
as in Definition^ Denote p{x) :— Q for a; < 0, and p{x) := 
p{x) + p{—x) for X > as the PDF of \Xn\, and F{t) :— 
1P(|^| ^ t) OS its cumulative density function (CDF). Assume 
that F is continuous and strictly increasing on some interval 
[a h], with F{a) = and F(h) — 1, where < a < 6 < oo. 
For any < k < 1, define the following function: 



Gq[p]{n) := 



/p^'''^'"' x'ip{x)dx 
xip{x)dx 



(5) 



and for any sequence fcjy such that Vmvj^^rx, ^ ~ k E 
[0, 1], the following holds almost surely 

lim CTfc„(xAr)« Gg[p](K). (6) 

2) Unbounded moments: assume E|X|' = oo for some 
q G (0, oo). Then, for < k < 1 and any sequence 
such that limTv-foo ^ = the following holds almost 



surely 



N 



lim crfc„(xAr)' Gq[p]{K) = 0. 



(7) 



Proposition [T] provides a principled way of obtaining the 
compressibility parameters (e, k) of distributions in the high 
dimensional scaling of the vectors. An immediate application 
is the incompressibility of the Laplace distribution. 

Example 1. As a stylized example, consider the Laplace 
distribution (also known as the double exponential) with scale 
parameter 1, whose PDF is given by 



Pi{x) := ^exp(-|a;|). 



(8) 



We compute in Appendix 0' 

Gi[pi](At) = l-K- (l+lnl/Aj), (9) 

G2[pi]{n) = l- n - (l + lnl/K+i(lnl/K)2). (10) 

Therefore, it is straightforward to see that the Laplace distri- 
bution is not g-compressible /or q £ {1, 2\: it is not possible 
to simultaneously have both k and e — Gq[pi](^K) small. 

B. Sparse modeling vs. sparsity promotion 

We show that the maximum a posteriori (MAP) interpreta- 
tion of standard deterministic sparse recovery algorithms is, 
in some sense, inconsistent. To explain why, we consider the 
following decoding approaches to estimate a vector x from its 
encoding y = €>x: 

Ai(y) = argmin||x||i, (11) 

x:y— 

ALs(y) = argmin ||x||2 = *+y, (12) 

x:y— 

Aoracie(y, A) = argmin ||y - *x||2 = ^Jy, (13) 

x:support(x)— A 

Atriviai(y) - 0. (14) 

Here, $a denotes the sub-matrix of $ restricted to the 
columns indexed by the set A. The decoder Ai regularizes the 
solution space via the £^-norm. It is the de facto standard Basis 
Pursuit formulation [TTl for sparse recovery, and is tightly 
related to the Basis Pursuit denoising (BPDN) and the least 
absolute shrinkage and selection operator (LASSO) lISTI : 

1, 



ABPDN(y) = argmin 



*5c||^ + A||i|| 



1) Bounded moments: assume E|X|'' < oo for some q E 
(0,oo). Then, Gq[p]{K) is also well defined for k — 0, 



where A is a constant. Both Ai and the BPDN formulations 
can be solved in polynomial time through convex optimization 
techniques. The decoder Als is the traditional minimum least- 
squares solution, which is related to the Tikhonov regulariza- 
tion or ridge regression. It uses the Moore-Penrose pseudo- 
inverse = 4>-^($^'-^)^^. The oracle sparse decoder Aoiacie 
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can be seen as an idealization of sparse decoders, which 
combine subset selection (the choice of A) with a form of 
linear regression. It is an "informed" decoder that has the 
side information of the index set A associated with the largest 
components in x. The trivial decoder Atriviai plays the devil's 
advocate for the performance guarantees of the other decoders. 

1} Almost sure performance of decoders: When the encoder 
€> provides near isometry to the set of sparse vectors ||6], the 
decoder Ai features an instance optimality property \ii3\ . lfT4l : 



|Ai(*x)-x|li <C7fc(*)-a,(x)i,Vx; 



(15) 



where Cfe($) is a constant which depends on A similar 
result holds with the || • ||2 norm on the left hand side. Unfor- 
tunately, it is impossible to have the same uniform guarantee 
for all X with (Tfc(x)2 on the right hand side ifTJl . but for any 
given X, it becomes possible in probability {131 . ifTSll . For a 
Gaussian encoder, Ai recovers exact sparse vectors perfectly 
from as few as to w 2ek log N/k with high probabiUty |fT6l . 

Definition 4 (Gaussian encoder). Let (f>ij, i,j E N be iid 

Gaussian variables Af{0, 1). The m x N Gaussian encoder is 
the random matrix := ['/'y /\A«]i<i<,„^i<j<Ar- 

In the sequel, we only consider the Gaussian encoder, 
leading to Gaussian compressed sensing (G-CS) problems. 
In Section IIVI we theoretically characterize the almost sure 
performance of the estimators Als, Aoracie for arbitrary high- 
dimensional vectors x. We concentrate our analysis to the 
noiseless setting (n = 0). The least squares decoder Als has 
expected performance E#||Als(*x) — x||2/||x||2 = 1 — ^, 
independent of the vector x, where 

5:=m/N (16) 

is the undersampling ratio associated to the matrix $ (this 
terminology comes from compressive sensing, where # is a 
sampling matrix). In theorem |3] the expected performance of 
the oracle sparse decoder Aoiacie is shown to satisfy 

lE#||Aoi-acle(^X,A) - X||| _ 1 gfc(x)^ 

Ml "1-^ iixiir 



This error is the balance between two factors. The first factor 
grows with k (the size of the set A of largest entries of 
X used in the decoder) and reflects the (ill-)conditioning 
of the Gaussian submatrix $a. The second factor is the 
best fc-term relative approximation error, which shrinks as k 
increases. This highlights the inherent trade-off present in any 
sparse estimator, namely the level of sparsity k versus the 
conditioning of the sub-matrices of 

2) A few surprises regarding sparse recovery guarantees: 
We highlight two counter-intuitive results below: 

^Coping with noise in such problems is important both from a practical 
and a statistical perspective. Yet, the noiseless setting is relevant to establish 
negative results such as Theorem [T] which shows the failure of sparse 
estimators in the absence of noise, for an 'undersampling ratio' 5 bounded 
away from zero. Straightforward extensions of more positive results such as 
Theorem |2] to the Gaussian noise setting can be envisioned. 



a) A crucial weakness in appealing to instance optimal- 
ity: Although instance optimality (flSl l is usually considered as 
a strong property, it involves an implicit trade off: when k is 
small, the fc-term error crfe(x) is large, while for larger k, the 
constant Cfc($) is large. For instance, we have Ck{^) — oo, 
when k > m. 

In Section |III] we provide new key insights for instance 
optimality of algorithms. Informally, we show that when 
XAT e is iid with respect to p{x) as in Definition |2l and 
when p{x) satisfies the hypotheses of Proposition [T] if 



Gi[p]{ko) > 1/2, 



(17) 



where kq w 0.18 is an absolute constant, then the best possible 
upper bound in the instance optimality (fTsT i for a Gaussian 
encoder satisfies (in the limit of large N) 

Cfc(*) • CTfe(x)i > ||x||i = ||Atrivial(x) - x||i. 

In other words, /or distributions with PDF p{x) satisfying ( fTTI l, 
in high dimension N, instance optimality results for the 
decoder Ai with a Gaussian encoder can at best guarantee 
the performance (in the £^ norm) of the trivial decoder Atriviai- 
Condition ( fTTI l holds true for many general PDF's; it is eas- 
ily verifiable for the Laplace distribution based on Example [T] 
and explains the observed failure of the £^ decoder on Laplace 
data ||29l . This is discussed further in Section Hill 

b) Fundamental limits of sparsity promoting decoders: 
The expected i'^ relative error of the least-squares estimator 
Als degrades linearly as 1 — (5 with the undersampUng factor 
5 := m/N, and therefore does not provide good reconstruction 
at low sampling rates 5 ^ 1. It is therefore quite surprising 
that we can determine a large class of distributions for which 
the oracle sparse decoder Aoiacie is outperformed by the simple 
least-squares decoder Als- 

Tlieorem 1. Suppose that xjv G is iid with respect to 
p{x) as in Definition^ and that p{x) satisfies the hypotheses 
of Proposition [7] and has a finite fourth-moment 

^X'^ < oo. 

There exists a minimum undersampling ratio Sq with the 
following property: for any p € (0, 1), if is a sequence of 
mj^ X N Gaussian encoders with limAr_>oo ttin/N = 6 < Sq, 
and lim7v->.oo /n^N — p, then we have almost surely 



lim 

JV-)-oc 



|Aoraeie(*AfXjv, Atv) — XAr||2 



Ixjvlli 



a^. G2[p]{p5) 

>l-5 =■ 



lim 



JV- 



\Ais{^n^n) - Xjvl 
llx^lli 



Thus if the data PDF p{x) has a finite fourth moment and a 
continuous CDF, there exists a level of undersampling below 
which a simple least-squares reconstruction (typically a dense 
vector estimate) provides an estimate, which is closer to the 
true vector x (in the sense) than oracle sparse estimation! 

Section |V] describes how to determine this undersampling 
boundary, e.g., for the generalized Gaussian distribution. For 
the Laplace distribution, Sq w 0.15. In other words, when 
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randomly sampling a high-dimensional Laplace vector, it is 
better to use least-squares reconstruction than minimum £^ 
norm reconstruction (or any other type of sparse estimator), 
unless the number of measures ni is at least 15% of the original 
vector dimension N. To see how well Theorem [T] is grounded 
in practice, we provide the following example: 

Example 2. Figure\l]examines in more detail the performance 
of the estimators for Laplace distributed data at various 
undersampling values. The horizontal lines indicate various 
signal-to-distortion-ratios (SDR) of 3dB, 10dBand20dB. Thus 
for the oracle estimator to achieve IQdB, the undersampling 
rate must be greater than 0.7, while to achieve a performance 
level of 2QdB, something that might reasonably be expected in 
many compressed sensing applications, we can hardly afford 
any subsampling at all since this requires 5 > 0.9. 

Relative Error for Laplace Distributed Data 

1.4 1 , , , , , , , I 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Undersampling rate, 5 



Fig. 1. The expected relative error as a function of the undersampHng rates 
5 for data iid from a Laplace distribution using: (a) a linear least squares 
estimator (solid) and (b) the best oracle sparse estimator (dashed). Also plotted 
is the empirically observed average relative error over 5000 instances for the 
Ai estimator (dotted). The horizontal lines indicate SDR values of 3dB, lOdB 
and 20dB, as marked. 



This may come as a shock since, in Bayesian terminology, 
£^-norm minimization is often conventionally interpreted as 
the MAP estimator under the Laplace prior, while least squares 
is the MAP under the Gaussian prior. Such MAP interpreta- 
tions of compressed sensing decoders are further discussed 
below and contrasted to more geometric interpretations. 

C. Pitfalls of MAP "interpretations" of decoders 

Bayesian compressed sensing methods employ probability 
measures as "priors" in the space of the unknown vector x, and 
arbitrate the solution space by using the chosen measure. The 
decoder Ai has a distinct probabilistic interpretation in the 
statistics literature. If we presume an iid probabilistic model 
for X as p{Xn) oc exp (— c|X„|) (n — 1, . . . , N), then Abpdn 
can be viewed as the MAP estimator 

AMAp(y) := argmaxp(x|y) = arg min{- logp(x|y)}, 



when the noise n is iid Gaussian, which becomes the Ai 
decoder in the zero noise limit. However, as illustrated by 
Example |2l the decoder Amap performs quite poorly for iid 
Laplace vectors. The possible inconsistency of MAP estima- 
tors is a known phenomenon (26"]. Yet, the fact that Amap is 
outperformed by Als — which is the MAP under the Gaus- 
sian prior — when x is drawn iid according to the Laplacian 
distribution should remain somewhat counterintuitive to many 
readers. 

It is now not uncommon to stumble upon new proposals 
in the literature for the modification of Ai or BPDN with 
diverse thresholding or re-weighting rules based on different 
hierarchical probabilistic models — many of which correspond 
to a special Bayesian "sparsity prior" p(x) cx exp(— (/)(x)) 
|[T2i . associated to the minimization of new cost functions 

A0(y) := argmin i||y - *x||^ + (/.(x). 

X Z 

It has been shown in the context of additive white Gaussian 
noise denoising that the MAP interpretation of such penal- 
ized least-squares regression can be misleading ||201 . Just as 
illustrated above with (^(x) = A||x||i, while the geometric 
interpretations of the cost functions associated to such "pri- 
ors" are useful for sparse recovery, the "priors" exp(— 0(x)) 
themselves do not necessarily constitute a relevant "generative 
model" for the vectors. Hence, such proposals are losing a 
key strength of the Bayesian approach: the ability to evaluate 
the "goodness" or "confidence" of the estimates due to the 
probabilistic model itself or its conjugate prior mechanics. 

In fact, the empirical success of Ai (or Abpdn) results from 
a combination of two properties: 

1) the sparsity-inducing nature of the cost function, due to 
the non-differentiability at zero of the 1^ cost function; 

2) the compressible nature of the vector x to be estimated. 
Geometrically speaking, the objective |jx|ji is related to the 

^^-ball, which intersects with the constraints (e.g., a randomly 
oriented hyperplane, as defined by y = $x) along or near the 
fc-dimensional hyperplanes {k ^ N) that are aligned with the 
canonical coordinate axes in R^. The geometric interplay of 
the objective and the constraints in high-dimensions inherently 
promotes sparsity. An important practical consequence is the 
ability to design efficient optimization algorithms for large- 
scale problems, using thresholding operations. Therefore, the 
decoding process of Ai automatically sifts smaller subsets 
that best explain the observations, unlike the traditional least- 
squares Als. 

When Xftf has iid coordinates as in Definition |2] compress- 
ibility is not so much related to the behavior (differentiable 
or not) of p{x) around zero but rather to the thickness of 
its tails, e.g., through the necessary property EX* — oo (cf 
Theorem [TJ. We further show that distributions with infinite 
variance (EX^ = oo) almost surely generate vectors which 
are sufficiently compressible to guarantee that the decoder 
Ai with a Gaussian encoder $ of arbitrary (fixed) small 
sampling ratio S = m/N has ideal performance in dimensions 
N growing to infinity: 

Theorem 2 (Asymptotic performance of the £^ decoder under 
infinite second moment). Suppose that xjv G is iid with 
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respect to p{x) as in Definition |2] and that p{x) satisfies the 
hypotheses of Proposition Q] and has infinite second moment 
EX^ = oo. Consider a sequence of integers rriN such that 
liniTv-i-oo /N =^ S where < 5 < I is arbitrary, and let 
€>Ar be a sequence of rriN x N Gaussian encoders. Then 

1 1 Ai A/X AJ ) — X A/ 1 1 2 a s 

lim " ^ , , = 0. (18) 

As shown in Section [Vl] there exist PDFs p{x), which 
combine heavy tails with a non-smooth behavior at zero, such 
that the associated MAP estimator is sparsity promoting. It is 
likely that the MAP with such priors can be shown to perform 
ideally well in the asymptotic regime. 

D. Are natural images compressible or incompressible ? 

Theorems [T] and |2] provide easy to check conditions for 
(in)compressibility of a PDF p{x) based on its second of fourth 
moments. These rules of thumb are summarized in Table |I] 
providing an overview at a glance of the main results obtained 
in this paper 

We conclude this extended overview of the results with 
stylized application of these rules of thumb to wavelet and dis- 
crete cosine transform (DCT) coefficients of the natural images 
from the Berkeley database ||241 . Our results below provide 
an approximation theoretic perspective to the probabilistic 
modeling approaches in natural scene statistics community 

ED, Eol, 1^. 

Figure |2] illustrates, in log-log scale, the average of the 
magnitude ordered wavelet coefficients (Figures |2}(a)-(c)), and 
of the DCT coefficients (Figure |2}(b)). They are obtained 
by randomly sampling 100 image patches of varying sizes 
N = 2^ X 2^ (j — 3, . . . , 8), and taking their transforms 
(scaling filter for wavelets: Daubechies4). For comparison, 
we also plot the expected order statistics (dashed lines), as 
described in |9|, of the following distributions (cf Sections [V-BI 
and[ylli 

• GPD: the scaled generalized Pareto distribution with 
density jPr^six/ X), r = 1, with parameters s = 2.69 
and A = 8 (Figure |2}(a)); 

• Student's t: the scaled Student's t distribution with den- 
sity jPr.s{x/X), T = 2, with parameters s = 2.64 and 
A = 4.5 (Figure |2}(b)); 

• GGD: the scaled generalized Gaussian distribution with 
density jPr{x/X), with r ~ 0.7 and A = 5 (Figure |2} 
(c)). 

The GGD parameters were obtained by approximating the 
histogram of the wavelet coefficients at = 8 x 8, as it is 
the common practice in the signal processing community fTOl . 
The GPD and Student's t parameters were tuned manually. 

One should note that image transform coefficients are 
certainly not iid t29J , for instance: nearby wavelets have 
correlated coefficients; wavelet coding schemes exploit well- 
known zero-trees indicating correlation across scales; the 
energy across wavelet scales often follows a power law decay. 

The empirical goodness-of-fits in Figure |2] (a), (b) seem 
to indicate that the distribution of the coefficients of natural 



images, marginalized across all scales (in wavelets) or fre- 
quencies (DCT) can be well approximated by a distribution 
of the type p^^s (c/ Table with "compressibility parameter" 
s w 2.67 < 3. For this regime the results of |]9] were incon- 
clusive regarding compressibility. However, from Table |I]we 
see that such a distribution satisfies KX'^ — oo (c/ Example |4] 
in Section IVlb . and therefore we are able to conclude that 
in the limit of very high resolutions N oo, such images 
are sufficiently compressible to be acquired using compressive 
sampling with both arbitrary good relative precision and 
arbitrary small undersampling factor 5 = m/N <Si 1. 

Considering the GGD with parameter r = 0.7, the results 
of Section IV-BI (cf Figure |6ll indicate that it is associated 
to a critical undersampling ratio (5o(0.7) « 0.04. Below this 
undersampling ratio, the oracle sparse decoder is outperformed 
by the least square decoder, which has the very poor expected 
relative error 1 — S > 0.96. Should the GGD be an accurate 
model for coefficients of natural images, this would imply 
that compressive sensing of natural images requires a number 
of measures at least 4% of the target number of image pixels. 
However, while the generalized Gaussian approximation of the 
coefficients appear quite accurate at A^ = 8 x 8, the empir- 
ical goodness-of-fits quickly deteriorate at higher resolution. 
For instance, the initial decay rate of the GGD coefficients 
varies with the dimension. Surprisingly, the GGD coefficients 
approximate the small coefficients (i.e., the histogram) rather 
well irrespective of the dimension. This phenomenon could be 
deceiving while predicting the compressibility of the images. 

III. Instance optimality, ^''-balls and 

COMPRESSIBILITY IN G-CS 

Well-known results indicate that for certain matrices, 4>, and 
for certain types of sparse estimators of x, such as the mini- 
mum norm solution, Ai(y), an instance optimality property 
holds |13|. In the simplest case of noiseless observations, this 
reads: the pair {4>, A} is instance optimal to order k in the 
norm with constant Ck if for all x: 

||A(*x)-x||,<Cfc-f7fc(x), (19) 

where crfc(x)q is the error of best approximation of x with 
fc-sparse vectors, while Ck is a constant which depends on k. 
Various flavors of instance optimality are possible (|6], ifTJI . 
We will initially focus on £^ instance optimality. For the £^ 
estimator (fTTT i it is known that instance optimality in the £^ 
norm (i.e. g = 1 in (T% ) is related to the following robust null 
space property. The matrix 4> satisfies the robust null space 
property of order k with constant 77 < 1 if: 

llzolli <77l|znl|i (20) 

for all nonzero z belonging to the null space kemel($) := 
{z. $z = 0} and all index sets of size k, where the notation 
zn stands for the vector matching z for indices in fl and zero 
elsewhere. It has further been shown lfT4l . Il33] that the robust 
null space property of order k with constant rjk is a necessary 
and sufficient condition for £^ -instance optimality with the 
constant Ck given by: 

Ck = 2ii±M (21) 

(1 - Vk) 
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TABLE I 

Summary of the main results 



Moment property 


EX^ = oo 


EX'^ < oo and EX* = oo 


EX* < oo 


General result 


ThcoTctn |2j 
Ai performs ideally 
for any 5 


N/A 
depends on finer 
properties of p(x-) 


Als outperforms Aoracie 
for small 5 < 5^ 


Compressible 


YES 


YES or NO 


NO 


Examples 




Proposition \2\(Section IV-AII; 

po{x) :=2|a;|/(x2 + l)3 

Aoracic performs just as Als 


Section \V-B\ 
Pt(x) oc exp(-|a;|^) 

< r < oo 
Generalized Gaussian 


Example \4\(Section Wit: 

Pr,six)cc(l + \x\-^)-''/^ 

Generalized Pareto (r = 1) / Student's t (r = 2) 
Case 1 < s < 3 ^ Case 3 < s < 5 Case s > 5 
1 Aoi-iidc outperforms Als ^ 
for small 5 < So 




(a) Wavelet/GPD (b) DCT/Student's t distribution (c) Wavelet/GGD 



Fig. 2. Solid lines illustrate the Wavelet or DCT transform domain average order statistics of image patches from the Berkeley database ( 24 1 . Dashed lines 
show the theoretical expected order statistics of the GPD, Student's t, and the GGD distributions with the indicated parameter values. The resolution of image 
patch sizes varies from left to right as {(8 X 8), (16 X 16), . . . , (256 X 256)}, respectively. 



Instance optimality is commonly considered as a strong 
property, since it controls the absolute error in terms of the 
"compressibility" of x, expressed through crfc(x). For instance 
optimality to be meaningful we therefore require that (Tk{x) 
be small in some sense. This idea has been encapsulated in a 
deterministic notion of compressible vectors [13 |. Specifically 
suppose that x lies in the £^ ball of radius R or the weak C 
ball of radius R defined as: 

||x||^,. :=sup||x|;.ni/'-| <i?, (22) 

with the n-th largest absolute value of elements of x 
(Figure [3ja) illustrates the relationship between the weak P' 
ball and the f ball of the same radius). Then we can bound 

a-fc(x), for g > r, by 

/ „ \ 1/9 

afc(x),<i?f-— j fc-(i/'-i/9), (23) 

therefore guaranteeing that the fc-term approximation error is 
vanishingly small for large enough k. 

Such models cannot be directly applied to the stochastic 
framework since, as noted in 1 1 1, iid realizations do not belong 
to any weak C ball. One obvious way to resolve this is to 
normalize the stochastic vector. If E|X|'' = C < oo then by 
the strong law of large numbers, 

W^nWI/N ^ C. (24) 




(a) (b) 



Fig. 3. (a) A cartoon view of an P' ball (white) and the weak P' ball of 
the same radius (grey); (b) A cartoon view of the notion of the compressible 
rays model. 



For example, such a signal model is considered in ifTSi for 
the G-CS problem, where precise bounds on the worst-case 
asymptotic minimax mean-squared reconstruction error are 
calculated for based decoders. 

It can be tempting to assert that a vector drawn from 
a probability distribution satisfying (l24l is "compressible." 
Unfortunately, this is a poor definition of a compressible 
distribution because finite dimensional C balls also contain 
'flat' vectors with entries of similar magnitude, that have very 
small fc-term approximation error . . . only because the vectors 
are very small themselves. 



8 



TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 



For example, if has entries drawn from the Laplace 
distribution then x^/N will, with high probability, have an 
£^-norm close to 1. However the Laplace distribution also has 
a finite second moment EX'^ — 2, hence, with high probability 
xn/N has ^^-norm close to This is not far from the 

i"^ norm of the largest flat vectors that live in the unit £^ ball, 
which have the form |x|„ — 1/N, 1 < n < N. Hence a 
typical iid Laplace distributed vector is a small and relatively 
flat vector. This is illustrated on Figure |4] 




Fig. 4. A cartoon view of tlie £^ and "rings" where vectors with iid 
Laplace-distributed entries concentrate. The radius of the £^ ring is of the 
order of while that of the £^ ring is one, coiTesponding to vectors 

with flat entries |x|,i 1/A''. 

Instead of model (l24l) we consider a more natural normal- 
ization of (Tfc(x)q with respect to the size of the original vector 
X measured in the same norm. This is the best fc-term relative 
error a-fc(x)g that we investigated in Proposition [T] The class 
of vectors defined by (Jk{x)q < C for some C does not have 
the shape of an £^ ball or weak ball. Instead it forms a set 
of compressible 'rays' as depicted in Figure |3] (b). 

A. Limits of G-CS guarantees using instance optimality 

In terms of the relative best fc-term approximation error, the 
instance optimality implies the following inequality: 

^i^4^<min{C..c..(x)} 

||x|| k 

Note that if we have the following inequality satisfied for 
the particular realization of x 



then the only consequence of instance optimality is that 
||A(^'x) — x|| < ||x||. In other words, the performance 
guarantee for the considered vector x is no better than for 
the trivial zero estimator: Atriviai(y) ~ 0, for any y. 

This simple observation illustrates that one should be careful 
in the interpretation of instance optimality. In particular, de- 
coding algorithms with instance optimality guarantees may not 



universally perform better than other simple or more standard 
estimators. 

To understand what this implies for specific distributions, 
consider the case of £^ decoding with a Gaussian encoder 
^N- For this coder, decoder pair, {$jv, Ai}, we know there 
is a strong phase transition associated with the robust null 
space property (|20| | with < < 1 (and hence the instance 
optimality property with 1 < C < oo) in terms of the 
undersampling factor 6 :— m/N and the factor p := k/m 
as k,m,N — )■ oo ll33l . This is a generalization of the £^ 
exact recovery phase transition of Donoho and Tanner fTSl 
which corresponds to r] — 1. We can therefore identify the 
smallest instance optimality constant asymptotically possible 
as a function of p and 6 which we will term C{p,6). 

To check whether instance optimality guarantees can beat 
the trivial zero estimator Atriviai for a given undersampling 
ratio 6, and a given generative model p{x), we need to consider 
the product of iTfe(x)i Gi[p]{k) and C(|,(5). If 

GiW(«:)> 777^- V«:e[0,J] (25) 

then the instance optimality offers no guarantee to outperform 
the trivial zero estimator 

In order to determine the actual strength of instance opti- 
mality we make the following observations: 

• C{1,S) > 2 for all n and S; 

. C(f ,(5) = oo for all 5 if k > kq ~ 0.18. 

The first observation comes from minimising Ck in ( 1211 1 with 
respect to < 77 < 1. The second observation stems from 
the fact that kq := max^jj s} P))(<5) ~ 0.18 lfT6l (where 
is the strong threshold associated to the null space property 
with constant ?/ < 1) therefore we have K = (5p<KoR::0.18 
for any finite C. From these observations we obtain : 

For distributions with PDF p{x) satisfying 
Gi[p](ko) > 1/2, in high dimension N, instance optimality 
results for the decoder Ai with a Gaussian encoder can at 
best guarantee the performance (in the norm) of ...the 
trivial decoder Atriviai- 

One might try to weaken the analysis by considering typical 
joint behavior of <I>Ar and xpf. This corresponds to the 'weak' 
phase transitions ITS'!, |33|. For this scenario there is a 
modified £^ instance optimality property |33|, however the 
constant still satisfies C{j,S) > 2. Furthermore since k < 6 
we can define an undersampling ratio Sq by G'i[p]((5o) = 1/2, 
such that weak instance optimality provides no guarantee that 
Ai will outperform the trivial decoder Atriviai in the region 
< 5 < Sq. More careful analysis will only increase the size 
of this region. 

Example 3 (The Laplace distribution). Suppose that xn — 
(Xi,...,Xn) has iid entries Xn that follow the Laplace 
distribution with PDF pi{x). Then for large N, as noted in 
Example [7] the relative best k-tenn error is given by: 

Gi[pi]{k) = 1-k - (i + IhI/k) 
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Figure \5\ shows that unfortunately this function exceeds 1/2 
on the interval n € [0, kq] indicating there are no non- 
trivial performance guarantees from instance optimality. Even 
exploiting weak instance optimality we can have no non-trivial 
guarantees below 5o sa 0.18. 

G(p^)(k)^ versus k 




Fig. 5. The £^-norm best fc-term approximation relative error Gi[pi](k) as 
a function of k = k/N (top curve) along witli a rectangular shaped function 
(bottom curve) that upper bounds inf|5 C'^^{k/S, 5). 



B. CS guarantees for random variables with unbounded sec- 
ond moment 

A more positive result (Theorem|2) can be obtained showing 
that random variables with infinite second moment, which are 
highly compressible (cf Proposition [T}, are almost perfectly 
estimated by the decoder Ai. In short, the result is based 
upon a variant of instance optimality: l"^ instance optimality 
in probability ifTJl which can be shown to hold for a large 
class of random matrices [T5|. This can be combined with 
the fact that when EX^ = oo, from Proposition [T] we have 
G'2[p](k) — for all < K < 1 to give Theorem |2] The proof 
is in the Appendix. 

Remark 1. A similar result can be derived based on £^ 
instance optimality that shows that when ¥,\X\ — oo, then 
the relative error in £^ for the £^ decoder with a Gaussian 
encoder asymptotically goes to zero: 

lAi(*ArX7v) - Xjvlll a.. 



lim ■ 



0. 



||XAr||i 

Whether other results hold for general P' decoders and relative 
P' error is not known. 

We can therefore conclude that a random variable with 
infinite variance is not only compressible (in the sense of 
Proposition [TJ: it can also be accurately approximated from 
undersampled measurements within a compressive sensing 
scenario. In contrast, instance optimality provides no guar- 
antees of compressibility when the variance is finite and 
Gi[p]{K,f)) > 1/2. At this juncture it is not clear where the 
blame for this result lies. Is it in the strength of the instance 
optimality theory, or are distributions with finite variance 
simply not able to generate sufficiently compressible vectors 



for sparse recovery to be successful at all? We will explore 
this latter question further in subsequent sections. 

IV. G-CS PERFORMANCE OF ORACLE SPARSE 
RECONSTRUCTION VS LEAST SQUARES 

Consider x an arbitrary vector in and $ be an m x 
Gaussian encoder, and let y := $x. Besides the trivial 
zero estimator Atrfviai (O and the £^ minimization estimator 
Ai (fTTl i. the Least Squares (LS) estimator Als (O is a 
commonly used alternative. Due to the Gaussianity of $ and 
its independence from x, it is well known that the resulting 
relative expected performance is 

E*||Als(*x) -x||2 



1 



TO 

N' 



(26) 



< 



Moreover, there is indeed a concentration around the expected 
value, as expressed by the inequality below: 

|Als(^x)^x||^ / TON 

Hi V n)' 

(27) 

for any e > and x e M^, except with probability at most 

The result is independent of the vector x, which should be 
no surprise since the Gaussian distribution is isotropic. The 
expected performance is directly governed by the undersam- 
pling factor, i.e. the ratio between the number of measures to 
and the dimension N of the vector x, 5 := m/N. 

In order to understand which statistical PDFs p{x) lead to 
"compressible enough" vectors x, we wish to compare the 
performance of LS with that of estimators A that exploit the 
sparsity of x to estimate it. Instead of choosing a particular 
estimator (such as Ai), we consider the oracle sparse estimator 
Aoracie defined in (fT3T l. which is likely to upper bound the 
performance of most sparsity based estimators. While in 
practice x must be estimated from y — $x, the oracle is 
given a precious side information: the index set A associated 
to the k largest components in x, where k < m. Given this 
information, the oracle computes 



^oracle 



(y,A) 



argmm 1 1 y 

support{x)— A 



*x|| 



where, since k < m, the pseudo-inverse is = 
(*J$a)^^^a- Unlike LS, the expected performance of the 
oracle estimators drastically depend on the shape of the best k- 
term approximation relative error of x. Denoting x/ the vector 
whose entries match those of x on an index set / and are zero 
elsewhere, and / the complement of an index set, we have the 
following result. 

Theorem 3 (Expected performance of oracle sparse estima- 
tion). Let X G be an arbitrary vector, ^ be an m x N 
random Gaussian matrix, and y := #x. Let A be an index 
set of size k < m — 1, either deterministic, or random but 
statistically independent /rom We have 

E*||A„™ri,(*x,A) -x||2 1 



1 



k 

m— 1 



> 



11^112 

o-fe (x) 



(28) 
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If A is chosen to be the k largest components of x, then the 
last inequality is an equality. Moreover, we can characterize 
the concentration around the expected value as 

fc(l-6)-3 



^ I k{l-ef ^ ||A„,„,fe(^x,A)-x||^ ^ ^ 



m — k + 1 



except with probability at most 



3 — min(/i:,m— fc+l)- c; (e)/2 



m — k + 1 
(29) 



(30) 



where 



ci{e) 



ln(l - e) - e > 



(31) 



Remark 2. Note that this result assumes that A is statistically 
independent /rom Interestingly, for practical decoders such 
as the £^ decoder, Ai, the selected A might not satisfy 
this assumption, unless the decoder successfully identifies the 
support of the largest components of x. 

A. Compromise between approximation and conditioning 

We observe that the expected performance of both A^s 
and Aoracie is essentially governed by the quantities 5 = m/N 
and p = k/m, which are reminiscent of the parameters in 
the phase transition diagrams of Donoho and Tanner |1&J. 
However, while in the work of Donoho and Tanner the quantity 
p parameterizes a model on the vector x^r, which is assumed 
to be p5N-sparse, here p rather indicates the order of fc-term 
approximation of xjy that is chosen in the oracle estimator 
In a sense, it is more related to a stopping criterion that one 
would use in a greedy algorithm. The quantity that actually 
models x^r is the function G2b], provided that xjv G 
has iid entries Xn with PDF p{x) and finite second moment 
EAT^ < oo. Indeed, combining Proposition [T] and Theorem |3] 
we obtain: 

Theorem 4. Let xjv be iid with respect to p{x) as in 
Proposition m Assume that EX^ < oo. Let i,j G N 
be iid Gaussian variables Af{0, 1). Consider two sequences 
kN^rriN of integers and assume that 



N 



lim kN/rriN 



Define the m jv 



p and 

N 



lim rriN /N = d. 



(32) 



N- 

Gaussian encoder = 
i^^Ar- Let Am be the index of the 
/ctv largest magnitude coordinates of-K.^^. We have the almost 
sure convergence 

^oracle 



lim 



lim 



ALs(*ArXAr) 



Xjv 



Ixjvlli 



G2[p]{p5 \ 
1-5. 



<33) 



(34) 



For a given undersampling ratio 5 — m/N, the asymptotic 
expected performance of the oracle therefore depends on the 
relative number of components that are kept p — k/m, and 
we observe the same tradeoff as discussed in Section |III1 

• For large k, close to the number of measures m (p close 
to one), the ill-conditioning of the pseudo-inverse matrix 



$A (associated to the factor 1/(1-/3)) adversely impacts 
the expected performance; 
• For smaller k, the pseudo-inversion of this matrix is better 
conditioned, but the /c-term approximation error governed 
by G2[p]{p5) is increased. 

Overall, for some intermediate size k « p*m of the oracle 
support set A^, the best tradeoff between good approximation 
and good conditioning is achieved, leading at best to the 
asymptotic expected performance 



H[p]iS):= inf ^aMM. 
pe(o,i) 1 - P 



(35) 



V. A COMPARISON OF LEAST SQUARES AND ORACLE 
SPARSE METHODS 

The question that we will now investigate is how the 
expected performance of oracle sparse methods compares to 
that of least squares, i.e., how large is H[p]{S) compared to 
1 — Sl We are particularly interested in understanding how they 
compare for small 5. Indeed, large S values are associated with 
scenarii that are quite irrelevant to, for example, compressive 
sensing since the projection $x cannot significantly compress 
the dimension of x. Moreover, it is in the regime where 6 is 
small that the expected performance of least squares is very 
poor, and we would like to understand for which PDFs p sparse 
approximation is an inappropriate tool. The answer will of 
course depend on the PDF p through the function G[p]{-). To 
characterize this we will say that a PDF p is incompressible 
at a subsampling rate of S if 

H{p]{S) >l-5. 

In practice, there is often a minimal undersampling rate, Sq, 
such that for S G {0,6q) least squares estimation dominates the 
oracle sparse estimator Specifically we will show below that 
PDFs p{x) with a finite fourth moment EX** < oo, such as 
generalized Gaussians, always have some minimal undersam- 
pling rate Sq E (0, 1) below which they are incompressible. As 
a result, unless we perform at least m > SqN random Gaussian 
measurement of an associated xat, it is not worth relying on 
sparse methods for reconstruction since least squares can do 
as good a job. 

When the fourth moment of the distribution is infinite, one 
might hope that the converse is true, i.e. that no such minimal 
undersampling rate Sq exists. However, this is not the case. We 
will show that there is a PDF po, with infinite fourth moment 
and finite second moment, such that 

H[po]{6) = 1 - 6, V5e(0,l). 

Up to a scaling factor, this PDF is associated to the symmetric 
PDF 

■= wTir ^^^^ 

and illustrates that least squares can be competitive with oracle 
sparse reconstruction even when the fourth moment is infinite. 
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A. Distributions incompatible with extreme undersampling 

In this section we show that when a PDF p{x) has a finite 
fourth moment, EX^ < oo, then it will generate vectors 
which are not sufficiently compressible to be compatible with 
compressive sensing at high level of undersampling. We begin 
by showing that the comparison of H[p]{S) to 1 — 5 is related 
to that of G2[p](k) with (1 - 



Lemma 1. Consider a function G{k) defined on (0, 1) 
define 

G{Sp) 



Hi6) 



inf 



and 



(37) 



pe(o.i) 1 - p 

1) If Gis^) < {1 - sr, 

then H{S) <l-5. 

2) // G{k) < (1 - for all k G (0, 
then H{d) < 1 - (5 /or all S G (0,(5o)- 

3) // G{k) > (1 - 0^)2 for all k G (0, So), 
then H{d) > 1 - (5 /or all S G (0,(5o)- 

Lemma [U allows us to deal directly with G2[p]{k.) instead 
of H[p]{6). Furthermore the (1 — y/n)'^ term can be related 
to the fourth moment of the distribution (see Lemma |3] in 
the Appendix) giving the following result, which implies 
Theorem [l] 



Theorem 5. // E 

undersampling Sq 



.p(j;)X* < oo, then there exists a minimum 
— 5o[p] > such that for S < Sq, 

H[p]{5)>l-S,y S e{0,So). (38) 



and the performance of the oracle k-sparse estimation as 
described in Theorem |4] is asymptotically almost surely worse 
than that of least squares estimation as N ^ oo. 

Roughly speaking, if p{x) has a finite fourth moment, then 
in the regime where the relative number of measurement is 
(too) small we obtain a better reconstruction with least squares 
than with the oracle sparse reconstruction! 

Note that this is rather strong, since the oracle is allowed 
to know not only the support of the k largest components of 
the unknown vector, but also the best choice of k to balance 
approximation error against numerical conditioning. A striking 
example is the case of generalized Gaussian distributions 
discussed below. 

One might also hope that, reciprocally, having an infi- 
nite fourth moment would suffice for a distribution to be 
compatible with compressed sensing at extreme levels of 
undersampling. The following result disproves this hope. 

Proposition 2. With the PDF pq (x) defined in ( |36] |, we have 



H[p„]{6) = 1- S,y Se (0,1). 



(39) 



On reflection this should not be that surprising. The PDF 
Po{x) has no probability mass at x = and resembles a 
smoothed Bemoulh distribution with heavy tails. 



where < r < oo. The shape parameter, r controls how 
heavy or light the tails of the distribution are. When r = 2 
the distribution reduces to the standard Gaussian, while for 
T < 2 it gives a family of heavy tailed distributions with 
positive kurtosis. When r = 1 we have the Laplace distribution 
and for r < 1 it is often considered that the distribution is 
in some way "sparsity-promoting". However, the generalized 
Gaussian always has a finite fourth moment for all r > 0. 
Thus Theorem |5] informs us that for a given parameter r 
there is always a critical undersampling value below which 
the generalized Gaussian is incompressible. 

While Theorem |5] indicates the existence of a critical 6o it 
does not provide us with a useful bound. Fortunately, although 
in general we are unable to derive explicit expressions for 
G[p]{-) and H[p]{S) (with the exceptions of r = 1,2 - see 
Appendix the generalized Gaussian has a closed form 
expression for its cdf in terms of the incomplete gamma 
function. 

^(-)-2+-g'^(-^ 2r(i/l) 

where r( ) and 7(-,) are respectively the gamma function 
and the lower incomplete gamma function. We are therefore 
able to numerically compute the value of Sq as a function 
of T with relative ease. This is shown in Figure |6] We see 
that, unsurprisingly, when r is around 2 there is little to be 
gained even with an oracle sparse estimator over standard least 
squares estimation. When r = 1 (Laplace distribution) the 
value of Sq « 0.15, indicating that when subsampling by a 
factor of roughly 7 the least squares estimator will be superior 
At this level of undersampling the relative error is a very poor: 
0.85, that is a performance of 0.7dB in terms of traditional 
Signal to Distortion Ratio (SDR). 

The critical undersampling value steadily drops as t tends 
towards zero and the distribution becomes increasingly lep- 
tokurtic. Thus data distributed according to the generalized 
Gaussian for small r <C 1 may still be a reasonable candidate 
for compressive sensing distributions as long as the undersam- 
pling rate is kept significantly above the associated i^o- 

6 for the Generalized Gaussian as a function of t 




B. Worked example: the generalized Gaussian distributions 

Theorem|5]applies in particular whenever xjv is drawn from 
a generalized Gaussian distribution. 



Fig. 6. A plot of the critical subsampling rate, Sq below which the generalized 
Gaussian distribution is incompressible as a function of the shape parameter, 

T. 



Pt{x) oc exp (— cja;!"^) 



(40) 
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C. Expected Relative Error for the Laplace distribution 

We conclude this section by examining in more detail the 
performance of the estimators for Laplace distributed data 
at various undersampling values. We have already seen from 
Figure|6]that the oracle performance is poor when subsampling 
by roughly a factor of 7. What about more modest subsampling 
factors? Figure [T] plots the relative error as a function of 
undersampling rate, 5. The horizontal lines indicate SDR 
values of 3dB, lOdB and 20dB. Thus for the oracle estimator to 
achieve lOdB the undersampling rate must be greater than 0.7, 
while to achieve a performance level of 20dB, something that 
might reasonably be expected in many sensing applications, 
we can hardly afford any subsampling at all since this requires 
5 > 0.9. 

At this point we should remind the reader that these 
performance results are for the comparison between the oracle 
sparse estimator and linear least squares. For practically imple- 
mentable reconstruction algorithms we would expect that the 
critical undersampling rate at which least squares wins would 
be significantly higher Indeed, as shown in Figure [T] this 
is what is empirically observed for the average performance 
of the £^ estimator ( fTTT i applied to Laplace distributed data. 
This curve was calculated at various values of S by averaging 
the relative error of 5000 £^ reconstructions of independent 
Laplace distributed realizations of xat with N = 256. In 
particular note that the £^ estimator only outperforms least 
squares for undersampling S above approximately 0.65! 

VI. Concluding discussion 

As we have just seen. Generalized Gaussian distributions are 
incompressible at low subsampling rates because their fourth 
moment is always finite. This confirms the results of Cevher 
obtained with a different approach (|9], but may come as a 
surprise: for < r < 1 the minimum i'^ norm solution to y = 
€>x, which is also the MAP estimator under the Generalized 
Gaussian prior, is known to be a good estimator of xq when 
y = 4>Xo and Xo is compressible 1 14|. This highlights the need 
to distinguish between an estimator and its MAP interpretation. 
In contrast, we describe below a family of PDFs p^^s which, 
for certain values of the parameters t, s, combines: 

• superior asymptotic almost sure performance of oracle 
sparse estimation over least squares reconstruction Aoracie, 
even in the largely undersampled scenarios 5 0; 

• connections between oracle sparse estimation and MAP 
estimation. 

Example 4. For 0<t<oo, 1<s<oo consider the 
probability density function 

cx (l + lxr)-^/^ (41) 

1) When 1 < s < 3, the distribution is compressible. 

Since Ep^^X^ — oo. Theorem^ is applicable: the 
decoder with a Gaussian encoder has ideal asymptotic 
performance, even at arbitrary small undersampling 5 = 

m/N; 

2) When 3 < s < 5, the distribution remains somewhat 
compressible. 



On the one hand Ep^ < oo, on the other hand 
Ep^,,X4 = oo. 

A detailed examination of the Gi [Pt,s] function shows 
that there exists a relative number of measures 
6o(t,s) > such that in the low measurement regime 
6 < So, the asymptotic almost sure performance of ora- 
cle of k-sparse estimation, as described in Theorem |4] 
with the best choice of k, is better than that of least 
squares estimation: 

H[pr,sm <l-S,yS eiO,So). (42) 

3) When s > 5, the distribution is incompressible. 

Since Ep^ ^ X'^ < oo. Theorem Q] is applicable: with a 
Gaussian encoder, there is an undersampling ratio 
such that whenever 5 < Sq, the asymptotic almost sure 
performance of oracle sparse estimation is worse than 
that of least-squares estimation; 

Comparing Proposition |2] with the above Example |4] one 
observes that both the PDF po{x) (Equation (|36] |) and the 
PDFs pr^s, 3 < s < 5 satisfy Ep^^X^ < oo and Ep^^X'' 
oo. Yet, while po is essentially incompressible, the PDFs 
Pr,s in this range are compressible. This indicates that, for 
distributions with finite second moment and infinite fourth 
moment, compressibility depends not only on the tail of the 
distribution but also on their mass around zero. However the 
precise dependency is currently unclear 

For r = 2, the PDF p2.s is a Student-t distribution. For 
T = 1, it is called a generalized Pareto distribution. These 
have been considered in ||9l, O as examples of "compressible" 
distributions, with the added condition that s < 2. Such a 
restriction results from the use of l"^ — £^ instance optimality 
in 191, El, which implies that sufficient compressibility con- 
ditions can only be satisfied when Ep|X| = oo. Here instead 
we exploit — £'^ instance optimality in probability, making 
it possible to obtain compressibility when EX^ = oo. In other 
words, [9 1, |2| provides sufficient conditions on a PDF p to 
check its compressibility, but is inconclusive in characterizing 
their incompressibility. 

The family of PDFs, ^ in the range < r < 1, can also 
be linked with a sparsity-inducing MAP estimate. Specifically 
for an observation y = 4>x of a given vector x e M^, one can 
define the MAP estimate under the probabilistic model where 
all entries of x are considered as iid distributed according to 

N N 

AMAp(y):=arg max TT p^,s(a;„) = argmiiiY^ /^(|a;„|). 

where for t e R+ we define fr{t) := log(l + T) = 
s — ^T-.s logpr,s(|i|)- One can check that the function fr is 
associated to an admissible /-norm as described in ||221 . ||23l : 
/(O) — 0, f{t) is non-decreasing, f{t)/t is non-increasing 
(in addition, we have /(t) ^t^Q V). Observing that the 
MAP estimate is a "minimum /-norm" solution to the linear 
problem y = $x, we can conclude that whenever x is 
a "sufficiently (exact) sparse" vector, we have in fact [i22i . 
[i23l| Amap(*x) = X, and Amap(*x) = Ai(*x) is also 
the minimum £^ norm solution to y = ^ix, which can 
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in turn be "interpreted" as the MAP estimate under the iid 
Laplace model. However, unlike the Laplace interpretation of 
£^ minimization, here Example |4] indicates that such densities 
are better aligned to sparse reconstruction techniques. Thus 
the MAP estimate interpretation here may be more valid. 

It would be interesting to determine whether the MAP 
estimator Amap(*x) for such distributions is in some way 
close to optimal (i.e. close to the minimum mean squared error 
solution for x). This would give such estimators a degree of 
legitimacy from a Bayesian perspective. However, we have not 
shown that the estimator Amap(^x) provides a good estimate 
for data that is distributed according to p^.s since, if x is a 
large dimensional typical instance with entries drawn iid from 
the PDF pr,s{x), it is typically not exactly sparse, hence the 
uniqueness results of ||22). p3| do not directly apply. One 
would need to resort to a more detailed robustness analysis 
in the spirit of 12 U to get more precise statements relating 
Amap(*x) to X. 



The proof of the two bounds is identical, hence we only detail 
the first one. Fix < e < tq and define r = T(e) := tq — e, 
a = a{e) := ydFyiy), and — Na. Defining L^q as 
in ( |43] l. we can apply Theorem|6]and obtain limTv^oo -7^ 
Fy{t). Since liniAr^oo ir = ^ ^ Fy{to), it follows that 



lim 

N^oo 



N-kN a^. Fy{to) 

Ln Fy{t) 



> 1 



where we used the fact that Fy is strictly increasing and r < 
tq. In other words, almost surely, we have N — kj^ > Lm for 
all large enough N . Now remember that by definition 

Ln = max {l < N, aN^fixNYq < Na} . 

As a result, almost surely, for all large enough N, we have 

Now, by the strong law of large number, we also have 



Appendix 

A. Proof of Proposition [7] 

To prove Proposition [T] we will rely on the following 
theorem jS] [Theorem 2.2]. 

Theorem 6. Suppose that Fy is a continuous and strictly 
increasing cumulative density function on [a, b] where < 
a < 6 < 00, with FY{a) = 0, Fy(6) = 1. For a G (0,^) 
where fi = ydFY{y), let r e (a, 6) be defined by the 
equation o' — ydFY{y). Let Si, S2, ■ ■ ■ be a sequence such 
that lim^v-foo Sn/N — a, and let Yi,Y2 . . . be iid random 
variables with cumulative density function Fy- Let Yi.at < 
• ■ • < Yn,n be the increasing order statistics 0/ Yi, . . . , Yjq 
and let Ln — L{N,Sn) be defined as L{N,s„) :— ;/ 
Yi^N > sn, otherwise: 



lim 



hence we obtain 



.. . crfe„(xAr)« o^s. cr ydFyiy) 
jv^oo IIxatII^ ^ fj. 

Since this holds for any e > and Fy is continuous, this 
implies ( |47] |. The other bound ( |48] | is obtained similarly. Since 
the two match, we get 



lim 



Since k 



<^kA^N)l a^. CydFYjy) ^ CydFYjy) 
M ydFY{y)' 



Fy{to) 



F{tq^'') we have tq 



[F-i(l - k)]''. Since = F{y^^i) we have dFy(y) 



L{N, .„) max{£ <N,Y,^n + ... + Y,,n < sn} ; (43) -y-W^')dy- As a resuh 



Then 



lim 



Y, 



Ln,N o^. 



N 



lim ^^d-FY{T), 



lim 



Fy{t). 



(44) 
(45) 
(46) 



r.'^ydFYiy) 
irydFviy) 



(a) 



Jo 



y'^My'/')dy 

xp{x)x''~^dx 



/q xp{x)xi ^dx 



Proof of Proposition [7} We begin by the case where 
E|Jf|' < 00. We consider random variables X„ drawn accord- 
ing the PDF p{x), and we define the iid non-negative random 
variables y„ = l-'fnl'^- They have the cumulative density 
function FYiy) = P{Y < y) ^ P{\X\ < y^/^) = F{y^/'>), 
and we have ii = WY = ¥.\X\i = \x\'idF(x) e (0,oo). 
We define x^r — {Xn)n=i^ we consider a sequence fcjv 
such that limAT^oo kN /N = k g (0, 1). By the assumptions 
on Fy there is a unique tq G (0, 00) such that n = 1 — Fy{to), 
and we will prove that 



Jo 



x'^p{x)dx 



x'ip{x)dx 

where in (a) we used the change of variable y = x*, x — y^^^, 
dy — qx'^^^dx. We have proved the result for < k < 1, and 
we let the reader check that minor modifications yield the 
results for k = and k = 1. 

Now we consider the case E|X|'' = +00. The idea is to use 
a "saturated" version X of the random variable X, such that 
E|X|' < 00, so as to use the results proven just above. 

One can easily build a family of smooth saturation functions 



. afe„(xjv)« J,^°ydFY{y) 

limmi — > — . 

^kA^N^, lo^ydFYiy) 

lim sup — < — . 

AT-s-oo N/J, fi 



In 







00) [0 277), < 77 < 00 with fr^it) = L for 



(47) 



(48) 



t G [0,77], friit) < t, for t > r/, and two additional properties: 

1) each function t i-?- is bijective from [0, 00) onto 
[0, 277), with f^{t) > for all t; 

2) each function t n- fn{t)/t is monotonically decreasing; 
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Denoting /,,(x) :— {fn{xi))'^^i, by ||23l Theorem 5], the first 
two properties ensure that for all 1 < A: < A^, x e M^, 

< ?/, g < oo we have 



We now proceed to prove Theorem |2] By assumption the 
idersampling ratio 6 — lim 
exists a < /v < 1 such that 



1/.(x)||? ■ 



(49) 



Consider a fixed rj and the sequence of "saturated" random 
variables X, = fr,{\X^\). They are iid with E\X\i < oo. 
Moreover, the first property of above ensures that their 
cdf t i-> Frf{t) :— P{frf{\X\) < t) is continuous and strictly 
increasing on [0 2ri\, with -^,,(0) = and ^i,(oo) — 1. Hence, 
by the first part of Proposition 1 just proven above, we have 

crfc„(/„(xAf))? a^. ^ r- u N _ /o^" ^^^^^ x'ip^{x)dx 



< 



(50) 



Since /^(t) < t for all t, we have F,,{t) = P{f^{\X\) < 
t) > F{\X\ < <) = F{t) for all t, hence F-\l - n) < 
— k). Moreover, since fri{t) — t for < t < r], we 
obtain E|/^(X)|« > J^x'iF{x)dx. Combining ^ and ^ 
with the above observations we obtain for any < 77 < oo 



lim sup 



< lim 



|XAr||J 



< 



xip{x)dx 



Since E|X|'' = x'^p{x)dx = 00, the infimum over 77 of 
the right hand side is zero. ■ 

Remark 3. To further characterize the typical asymptotic 
behaviour of the relative error when Ep(|X|'?) — 00 and 
kis! /N — ^ appears to require a more detailed characteriza- 
tion of the probability density function, such as decay bounds 
on the tails of the distribution. 

B. Proof of Theorem |2] 

The proof is based upon the following version of lfT5] 
Theorem 5.1]: 

Theorem 7 (DeVore et al. ESI). Let *(a;) e M^xw 
a random matrix whose entries are iid and drawn from 
A/'(0, 1/m). There are some absolute constants Cq, . . . , Cg, 
and C7 depending on Ci, . . . ,Cq such that, given any k < 
Com / log{N / m) then 



||x-Ai(*Hx)||2 <C7afc(x)2, 
with probability exceeding 



(51) 



1 - Cie 



'C2m 



In this version of the theorem we have specialized to the 
case where the random matrices are Gaussian distributed. 
We have also removed the rather peculiar requirement in the 
original version that N > [ln6]^m as careful scrutiny of the 
proofs (in particular the proof of Theorem 3.5 [15]) indicates 
that the effect of this term can be absorbed into the constant C3 
as long as ni/N < [j^]^ ~ 1.2, which is trivially satisfied. 



undersampling ratio S — Xvcujs!^^ > 0, therefore there 



S > CoKlog 



1 



Now choosing a sequence kN/N k we have, for large 
enough N, 

■niN > CokpflogiN/mN). 

Hence, applying Theorem |2l for all N large enough, there 
exist a set 57Ar(xjv, /cat) with 

¥{n%{KN, kN)) < Cgme-^^v^ (52) 

such that ( fSTI i holds for all $7v(cl>) e il(xjv,fcjv), i.e., 

IIxat - Ai(<I>Ar(a;)xAr)||2 



Xat 2 



<C7afc„(xjv)2. (53) 



A union bound argument similar to the one used in the proof 
of Theorem |4] (see Appendix iDli gives: 

||xa, - Ai($7vXAr)||2 ,. - f ^ 

hmsup < limsupC7crfcjv(xAr)2 

N->oo \\^n\\2 JV-s-oo 

= -C7G2[p](«:)-0. (54) 



C. Proof of Theorem \3\ 

We will need concentration bounds for several distributions. 
For the Chi-square distribution with n degrees of freedom 
Xn^ we will use the following standard result (see, e.g., ||3] 
Proposition 2.2], and the intermediate estimates in the proof 
of [3, Corollary 2.3]): 

Proposition 3. Let X G R" a standard Gaussian random 
variable. Then, for any < e < 1 

P{\\X\\l > n{l - < e-"-="(^)/2 (55) 

F{\\X\\l < n{l - e)) < e-"-^'(^)/2 (5^) 



with 



c«(e) 



+ ln(l-e) 



1 - e 

Q(e) :=-ln(l-e)-e. 



Note that 



eV2 < Q(e) < c„(e), < e < 1. 



(57) 
(58) 

(59) 



Its corollary, which provides concentration for projections of 
random variables from the unit sphere, will also be useful. 
The statement is obtained by adjusting |3, Lemma 3.2] and ID 
Corollary 3.4] keeping the sharper estimate from above. 

Corollary 1. Let X be a random vector uniformly distributed 
on the unit sphere in M", and let Xj^ be its orthogonal 
projection on a k-dimensional subspace L (alternatively, let 
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X be an arbitrary random vector and L be a random k- 
dimensional subspace uniformly distributed on the Grassman- 
nian manifold). For any < e < 1 we have 



P{^\\Xl\\2 < \\XU1 - e)) < e-'=-'(^)/2 + e- 



(61) 

The above result directly implies the concentration inequal- 
ity (|27] ) for the LS estimator mentioned in Section |IV] We 
will also need a result about Wishart matrices. The Wishart 
distribution 1251 Wf (n, S) is the distribution of £x £ matrices 
A = Z^Z where Z is an n x ^ matrix whose columns have 
the normal distribution Af{0, S). 

Theorem 8 ( ll25l [Theorem 3.2.12 and consequence, p. 
97-98]). If A is We{n,^) where n-£+l > 0, and if Z eR*^ 
is a random vector distributed independently of A and with 
P{Z = 0) = 0, then the ratio Z^Y.'^ Z / Z'^ A'^ Z follows 
a Chi-square distribution with n — £ + 1 degrees of freedom 
Xn-^+v '^^d is independent of Z. Moreover, if n^£— 1>0 
then 

EA-i = S-i • (n-£- 1)-^ (62) 

Finally, for convenience we formalize below some useful 
but simple facts that we let the reader check. 

Lemma 2. Let A and B be two independent m x k and 
mx £ random Gaussian matrices with iid entries M{0, l/m), 
and let x G M.^ be a random vector independent from B. 
Consider a singular value decomposition (SVD) A — LfSV 
and let Ui be the columns of U. Define w := Ba::/||Ba;||2 G 



W2 



wi \ w1\ 2 e 



and 



and 



W3 := V'^W2 e R''. We have 

1) w is uniformly distributed on the sphere in 
statistically independent from A; 

2) the distribution of wi is rotationally invariant in K.'^, 
and it is statistically independent from A; 

3) W2 is uniformly distributed on the sphere in R'^, and 
statistically independent from A; 

4) W3 is uniformly distributed on the sphere in R*^, and 
statistically independent from A. 

We can now start the proof of Theorem |3] For any index 
set J, we denote xj the vector which is zero out of J. 
For matrices, the notation indicates the sub-matrix of 
# made of the columns indexed by J. The notation J 
stands for the complement of the set J. For any index set 
A associated to linearly independent columns of $a we can 
write y — *axa + ^a^a hence 

Ao,-acie(y,A) := $+y 



Aoracle(y, A) - x|| 



XA 

l*l*AXAlir 



*I*AXa 



(63) 



The last equality comes from the fact that the restriction of 
(Aoracie(y, A) — x) to the indiccs in A is ^J^^xa, while its 
restriction to A is x^. Denoting 



w 



^AXA 

I*axaI|2 



(64) 



we obtain the relation 

||Ao„cie(y,A) -x|l2 



= ll*>ll2X 



>-All2 



>AXaI12 



^All2 



(65) 



(60) From the singular value decomposition 



0(m— fc) X k 



where Um is an m x m unitary matrix with columns u^, 
and Vfc is a fc X fc unitary matrix, we deduce that = 

F,^[S^\Ofcx(™-fc)]C^J, and 

fc 

Il*>ll2 = ||[s-i0fex(™-fc.)]f^™"'ll2 -E'^^'K"^'"')!'- 

£=1 

(66) 

Since and x^ are statistically independent, the random 
vector $axa £ is Gaussian with zero-mean and covari- 
ance mT^ ■ ||xa||| • Idm. Therefore, 



e{||*axaII2/I|xaII^}-i 

and by Proposition [3] for any < eo < 1 



(67) 



1 ^ < II^AXAlli ^- 

1 - ^0 < |i ||2 < (1 - 

I|Xa||2 



> 2 — 2-6^™''^''^'''^^. 

(68) 

Moreover, by Lemma |2}item |2] the random variables {ui,w), 
1 < £ < k are identically distributed and independent from 
the random singular values ct^. Therefore, 



E|1*>|1 




xE{\{u,wm 



=E|Trace(*J*A)-H x — . 

TO 

The matrix $J4>a is >Vfc(TO, ;^Idfc) hence, by Theorem |8] 
when m — fc — 1 > we have 



E||*>|| 



Trace (TOldfe) 
(to — fc — 1)to 



fc 



- fc- 1 



(69) 



Now, considering wi {{ui.w))^^-^ G R'^, W2 ■= wi/\\wi\\2 
and ws :— VuW2, we obtain 



fe'^«l|l2 



\w4lx\\Y-^W2\\l 



\wi\\l X ||S^Vfc«;3||^ 

\wi\\l X (*I*a)"^W3 = m\\wi\\l/R{w:i), 



™lk3lli/u'^(*A*AJ 



where Riw^) :- 

w|'(TO~^Idfe)^^W3/u'^(#J$A)^^W3. By Lemma |2}item m 
W3 is statistically independent from $a- As a result, by 
Theorem [8] the random variable R{w3) follows a Chi-square 
distribution with to — fc + 1 degrees of freedom Xm-k+i' ^'^'^ 
by Proposition |3] for any < ei < 1, 



1 - ei < Riws)^^ • (to - fc + 1) < (1 - ei)-i 
> I — 2e^^™'~^~^^^''^'^'^'^^^'^ . (70) 
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Moreover, since wi is a random fc-dimensional orthogonal „. . , , 

_ ° Finally, since k^q mj^ 

projection of the unit vector w, by Corollary [jj for any 



< 62 < 1 



(l-e2<m||w;i||2/fc< (1-62)-^) 

4g-fc-c,(e2)/2^ 



> 1 



To conclude, since $a^a is Gaussian, its l^-nonn ll^^x^lll 
and direction w are mutually independent, hence ||$^u'||2 and 

with the expected values ( |67] i 



^aXaIIt are also mutually independent. Therefore, we can 



■ A^All2 

combine the decomposition 
and (|69]l to obtain 



E||A„ 



=(y,A) 



k 



IE||*AXAlli 



1 



*'All2 



m - 



1 



1 = 



1 



k 

m— 1 



We conclude that: for any index set A of size at most fc, with 
fc < m — 1, in expectation 



E 1 1 ^oracle 

(y,A)-x||^ _ E||Ao,-ade(y, A) - x| 



^All2 



k 

m— 1 



> 



1 



k 

m— 1 



In terms of concentration, combining 
we get that for < eo, ei, £2 < 1: 



(Go), and dZB, 



(l-eo)(l-ei)(l-e2)<||*>| 



*aX 



A^All2 



k + 1 



<[(l-60)(l-ei)(l-e2)]-^ 

except with probability at most (setting — e, i — 0,1,2) 

2.g-m-Ci(eo)/2 ^ ^ , g-fc-c, (e2)/2 ^ ^ . g- ("!.-fc+l)-C! (ei )/2 
< 8 • min(fc,m— fc+l)-ci(is)/2 

D. Proof of Theorem |4] 

Remember that we are considering sequences 

fcjv, toat, #jv, Ajv, x^r. Denoting pjs! = kjs[/mN and 
5n = rnisi/N, we observe that the probability (|29) 
can be expressed as 1 — %Q-N-CN(f-)/2 vvhere 
CA'(e) — ci{e) ■ 5m ■ min(pAr,l — pn)- For any choice 
of e, we have 

lim CAr(e) — ci{t) ■ 5 ■ min(/3, 1 — p) > 0, 



hence X^at ^ 



-Af-c«(e)/2 



< oo and we obtain that for any 77 > 



El 

N 



A„ndL-(yw,Ajv)-~X]v||2 



1 X 



lim 



Pn ^ P and (5, we also have 

kn p 



Af-i-oo rriM 
and we conclude that 



1 1 



(71) lini 



^oracle 



(yAr,A 



N) 



XJVII2 a^. 



1 



N- 



Ixjvlli 



lim 



I|XJV||2 



1 - p Af- 

i-p 

We obtain the result for the least squares decoder by copying 
the above arguments and starting from (|2Tt . 

E. Proof of Lemma [7] 

For the first result we assume that G{S^) < (1 — S)'^. We 
take p = S and obtain by definition 

1 — p 1 — 

The second result is a straightforward consequence of the first 
one. For the last one, we consider S E {0,Sq). For any p E 
(0, 1) we set K := Sp E (0, Sq). Since for any pair a,b E (0, 1) 
we have (1 — a)(l — 6) < (1 — \/ab)'^, we have 

and we conclude that 

Vpe(0,i), ^>l-5. 



F. Proof of Theorem Q] and Theorem \5\ 

Theorem [T] and Theorem |5] can be proved from Theorem 2] 
and Lemma [T] along with the following result. 

Lemma 3. Let p{x) be teh PDF of a distribution with finite 
fourth moment EAT* < 00. Then there exists some 5q € (0, 1) 
such that the function G2 [p] (k) as defined in Proposition Q] 
satisfies 

G2[p]{n) > {I ~ , V«;e(0,Jo). (72) 

Proof of Lemma |5} Without loss of generality we can 
assume that p{x) has unit second moment, hence 



G2[p]{k) :-- 



jp u p{u)du 



v?'p{u)du. 



Jp°° u'^p{u)du 

where we denote a = F^^{1 — k), which is equivalent to k = 
l~F{a) = /^p(w)du. The inequality 6*2 > {l-y/^f 
is equivalent to 2-y/K > 1 + k — G2[p]{k), that is to say 



> 77 ) < 00^ 



p{u)du > {u + l)p{u)du 

J a 

By the Cauchy-Schwarz inequahty 

00 / />oo 

{u'^+l)p{u)du < \ (w^ + iyp{u)du- 



(73) 



p{u)du. 



This imphes d Corollary 4.6.1] the almost sure convergence since EX'^ < 00, for all small enough k (i.e., large enough a). 



lim 



^oracle 



(yat, Atv) 



-1 X 



rriN 



«Ar 



1\ a 1^6 right hand side is arbitrarily smaller than 2 \J p{u)du 
"hence the inequality G2[p](k) > (1 — holds true. ■ 
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Proof of Theorem Q] and Theorem |5} Theorem [T] and 
Theorem |5] now follow by combining Lemma |3] and Lemma [T] 
to show that for a distribution with finite fourth moment there 
exists a (5o e (0, 1) such that H{5) > 1 - 5 for all (5 e 
(0,(5o)- The asymptotic almost sure comparative performance 
of the estimators then follows from the concentration bounds 
in Theorem |3] and for the least squares estimator. ■ 

G. Proof of Proposition |2] 

Just as in the proof of Lemma [3] above, we denote a = 
F~^{1 — k), which is equivalent to k = 1 — F{a) = 
p{u)du. We know from Lemma [T] that the identity 
H[p\{p) = 1 — p for all < p < 1 is equivalent to G2[p]{k) = 
(1 — -^/k)^ for all < K < 1. By the same computations 
as in the proof of Lemma [3] under the unit second moment 
constraint Ep(^) = 1, the latter is equivalent to 




p(u)du 



{u^ + l)p{u)du 



(74) 



Denote K{a) :— {v? + l)p{u)du. The constraint is K{a) ■ 
K{q) — 4 p(u)du. Taking the derivative and negating we 
must have 2K{a) ■ [{a^ + 1) • = ^p{a). If p{a) ^ 

it follows that K{a) = 2/(a^ + 1) hence [a^ + 1) ■ p{a) = 
~K'{a) = 4q!/(q!2 + 1)2 that is to say p{a) = 'ia/{a^ + 1)'"^ 
which is satisfied for p{x) — po{x). One can check that 



4a 



("2 + 1)3 



da = 



(a2 + 1)2 



= 1 



and, since p{a) x 4a ^, EpQ(^)(X^) = oo. 

H. Proof of the statements in Example |4] 

Without loss of generality we rescale pr^s{x) in the form 
p{x) = (1/a) ■ pT,s{x/a) so that Pt.s is a proper PDF with 
unit variance EX^ = 1. Observing that Pt.s{x) >ix^oo x^^, 
we have: EX'^ < oo if, and only if s > 3; KX^ < oo if, and 
only if, s > 5. For large a, n = 0, 2, 3 < s < 5, we obtain 



x^p{x)dx : 



'dx: 



„n+l—s 



n+1 — s 



n + 1 ~ s 

hence, from the relation between k and a, we obtain 

1 + AC - G2[p]i>^) _ + l)p{u)du ^ (a^-- + ai-«) 



2 

5- 

a 2 



For 3 < s < 5 we get 



lim 



l+K-G2[p]iK) 

2v^ 



hence there exists (5o > such that for n < \/5q 

G2[p]{n) <l + K- 2^ = (1 - V^)2. 

We conclude using Lemma [T] 



/. The Laplace distribution 

First we compute pi (x) = exp(— a;) for x > 0, Fi (z) = 1 — 
e^^, z > hence -F'j~^(l — k) = — hi k. For all integers q > 1 
and a; > 0, we obtain by integration by parts the recurrence 
relation 



u'^e ^du = q 



I u'^-^e-^'du- xie-'',yq>l. 
Jo 



e^^du = 1 — e^^, hence for g = 1 we obtain ue^^''du ~ 
1 — — xe~^ =: 1 — (1 + x)e^^, and for q = 2 it is easy 
to compute 

/ M^e^^dw = 2- (2 + 2a; + a;2)e-^ 

^0 

([Tol l and ^ follow from substituting these expressions into: 

/p ^^^'^ u'^pi{u)du 



/p°° uipi{u)du 
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